George Savva (Quadram Institute Bioscience)
Start RStudio..
Make sure you have these packages installed..
library(UsingR) # dataset
library(readxl) # reading data from excel files
library(equatiomatic) # making equations from models
library(emmeans) # marginal means
library(broom) # enhancements to model
library(performance) # model diagnostics
library(ggplot2) # making graphs
library(gamlss) # enhanced modelling
library(car) # hypothesis tests
library(ggbeeswarm) # beeswarm graphs
library(patchwork) # place plots side-by-sideData:
diversity.csvsurvey.csvobservations.csvCode, slides and handouts:
presentation.qmdpresentation.htmlhandout.htmllinearmodels.pptx\[\text{Height}=a+b\times \text{Age}\]
Describes a linear relationship between an outcome and its predictors.
What is the model for?
Who is interested and what will it be used for?
\[ \operatorname{height} = \alpha + \beta_{1}(\operatorname{age}) + \epsilon \]
predict(model1, newdata = newages) age
1 27.87171 12
2 30.75998 24
3 33.64825 36
4 36.53652 48
5 39.42478 60
6 42.31305 72
7 45.20132 84
8 48.08959 96
9 50.97786 108
10 53.86613 120
11 56.75440 132
12 59.64267 144
13 62.53094 156
14 65.41921 168
15 68.30748 180
16 71.19574 192
17 74.08401 204
18 76.97228 216
19 79.86055 228
20 82.74882 240
\[ \operatorname{height} = \alpha + \beta_{1}(\operatorname{age}) + \epsilon \]
\[ \operatorname{\widehat{height}} = 24.98 + 0.24(\operatorname{age}) \]
Standard deviation of ‘error’ = 4.77
\[\epsilon \sim N(0,4.77^2)\]
A linear model describes the linear relationship between an outcome and predictors
Can be used for prediction or inference
Simple linear models are fit in R using lm.
broom::augment makes it easy to get predictions.
Visualisations allow us to check model fit.
Confidence intervals reflect uncertainty around means
Prediction intervals reflect likely variation in new points
Can we apply the same method to this dataset?
Can you forsee a problem?
What are the limitations of the ‘regular’ linear model?
model2 <- lm( weight ~ age , data = kid.weights )
model2 |> augment(interval="prediction") |>
ggplot(aes(x=age,y=weight))+theme_bw(base_size = 18)+
geom_ribbon(aes(y=.fitted, ymin=.lower, ymax=.upper),alpha=0.2,fill="blue") +
geom_line( aes(y=.fitted)) + geom_point() +
labs(x="Age (months)",y="Weight (lbs)")We can model how the error term changes with age.
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = weight ~ age, sigma.formula = ~age,
data = kid.weights, control = gamlss.control(trace = F))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.9963 0.4337 32.27 <2e-16 ***
age 0.4910 0.0160 30.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.087779 0.067469 16.12 <2e-16 ***
age 0.017055 0.001047 16.28 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
No. of observations in the fit: 250
Degrees of Freedom for the fit: 4
Residual Deg. of Freedom: 246
at cycle: 3
Global Deviance: 1662.226
AIC: 1670.226
SBC: 1684.312
******************************************************************
We can also model how the skew changes with age!
******************************************************************
Family: c("BCPE", "Box-Cox Power Exponential")
Call: gamlss(formula = weight ~ age, sigma.formula = ~age,
nu.formula = ~age, family = "BCPE", data = kid.weights,
control = gamlss.control(trace = F))
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.41228 0.43597 30.77 <2e-16 ***
age 0.48394 0.01593 30.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.765940 0.065727 -26.868 <2e-16 ***
age 0.002114 0.001097 1.928 0.055 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Nu link function: identity
Nu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.389413 0.431880 -0.902 0.368
age -0.007446 0.006028 -1.235 0.218
------------------------------------------------------------------
Tau link function: log
Tau Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8196 0.1432 5.722 3.09e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
No. of observations in the fit: 250
Degrees of Freedom for the fit: 7
Residual Deg. of Freedom: 243
at cycle: 7
Global Deviance: 1624.405
AIC: 1638.405
SBC: 1663.056
******************************************************************
We can model things other than the mean!
In the old days (when I was training) this would have been very difficult
Now gamlss and Bayesian analysis makes it easy
How good does a model need to be?
Start to understand the ‘modelling’ approach to statistics
Enable you to explore this extensive framework to do almost any kind of statistical / ML analysis
Get familiar with how to do this in R
Become a statistician?
Statistician at Quadram Institute Bioscience since 2017
Support for epidemiology / genomics / ecology / clinical trials / pre-clinical research
Suppose we are interested in understanding the typical heights of UEA students?
How should we go about this?
Can you frame a question in mathematical / statistical terms.
Setting the question
How should you set the scope of a statistical problem?
What is the target population?
What is the sampling frame?
What is the sample?
Students heights are not all the same, they have a ‘distribution’.
What kind of data is a ‘height?’
There is an average (central tendency) and a variance (spread).
We also have to assume or determine the shape of the distribution.
As a baseline - assume continuous variables are independently normally distributed
So, for example:
“Heights are normally distributed with mean 180cm and standard deviation 10cm”
We can represent this visually
Is it likely to be good enough to describe the distribution of student heights?
Models
What is a model?
\[\text{Height} \sim N(\text{mean} , \text{variance})\]
\[H \sim N(\mu , \sigma^2)\]
We describe how individual heights arise like this:
\[h_i = \mu + \epsilon_i\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2)\]
The model has a structural part and an error part
This is the simplest linear model.
Here’s our model for heights:
\[ \boxed{h_i = \mu + \epsilon_i\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2)} \]
A model is a mathematical or statistical way of writing down how we think our outcome is distributed
The model needs to be good enough for the purpose!
It needs to link the aspects of the process we are interested in (true but unknown mean \(\mu\) and variance \(\sigma^2\)) to the data that we can observe (\(h_i\)).
Yes! Whenever you’re doing any kind of statistics there is an underlying model.
An implicit understanding of how your data came to be
Ad hoc statistical procedures sometimes obscure hide these assumptions
But with linear modelling they are more explicit.
\[h_i = \mu + \epsilon_i\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2)\]
Key insight!
Once we have a model, all of statistical analysis is just estimating model parameters!
We can use data to estimate the parameters
If we get enough \(h_i\)’s (at least 2!) we can estimate the values of \(\mu\) and \(\sigma\).
Several ways these models are estimated:
(Not getting into this..the computer will do it)
Note
We can use this logic to show that the sample mean is the best estimate for the \(\mu\)
Enter some data and check it’s OK:
height
1 173
2 177
3 160
4 165
5 172
6 182
Estimate the model:
Display the model summary:
Call:
lm(formula = height ~ 1, data = dat)
Residuals:
Min 1Q Median 3Q Max
-12.778 -4.778 2.222 5.222 12.222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 169.778 2.722 62.37 4.86e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.167 on 8 degrees of freedom
Is our parameter estimate perfect? Have we completely discovered the average height of UEA students?
The confidence interval and standard error reflect the uncertainty in the estimate.
(There is also uncertainty in the standard deviation estimate, but this is harder to calculate using R and we don’t often use it.)
What can we really say about the mean and spread of student heights?
Is it plausible that the true mean height is 170cm?
Try some other values. 160cm, 163cm?
2.5 % 97.5 %
(Intercept) 163.5003 176.0552
# A tibble: 1 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 170. 2.72 62.4 4.86e-12 164. 176.
The confidence interval gives us every plausible value of the parameter given the data.
That is, every value not rejected at p<0.05 (for a 95% interval)
How good is our model? What were the model assumptions?
Heights are normally distributed
Individuals heights are independent of each other
(Our sample is representative)
How can we check normality?
A qqplot compares model residuals with a normal distribution:
Statistical inference is trying to estimate something fixed but unknown in a population using observed data in a sample
The model links the unknown parameters to the observed data
Allows us to do prediction and inference
There is a duality between confidence intervals and p-values
We can check normality using qqplots. A visual inspection like this is better than statistical tests for normality that you might see.
A student questionnaire was conducted at Adelaide University. We’ll use this dataset to answer some interesting questions.
Load data into R (from "survey.xlsx")
Estimate a model in R to find the distribution of student heights.
Check whether the heights are normally distributed
Check whether using the mean and sd functions in R gives you the same answer.
Let’s improve our model.
Remind ourselves of the purpose of a model
We might have a specific hypothesis to test
Or we might want to predict an outcome for a new individual
Or just understand the factors that contribute to our outcome in a population
Can we improve our height prediction?
Can we add a factor to our model that helps to explain height?
Different average for men and women.
We might write this as:
\[ h_i=\left\{\begin{array}{ll} \mu_\text{male} + \epsilon_i & \text{(if male)}\\ \mu_\text{female} + \epsilon_i & \text{(if female)} \end{array}\right.\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2) \]
But we would usually write this as:
\[ \boxed{h_i = \mu + \beta x_i+\epsilon_i \quad\text{where}\quad \left\{\begin{array}{ll} x_i=1 &\text{if male;}\\ x_i=0 & \text{if female} \end{array}\right.} \]
Can you interpret this model?
What are the mean heights for men and women under this model?
What does \(\beta\) represent?
Female Male
84 84
# A tibble: 6 x 13
...1 Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 Female 18.5 18 Right R on L 92 Left Some Never 173 Metric
2 2 Male 19.5 20.5 Left R on L 104 Left None Regul 178. Imper~
3 5 Male 20 20 Right Neither 35 Right Some Never 165 Metric
4 6 Female 18 17.7 Right L on R 64 Right Some Never 173. Imper~
5 7 Male 17.7 17.7 Right L on R 83 Right Freq Never 183. Imper~
6 8 Female 17 17.3 Right R on L 74 Right Freq Never 157 Metric
# ... with 1 more variable: Age <dbl>
A graph is a good way to start modelling!
How well does our model fit its assupmtions
# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 166. 0.787 211. 1.80e-203 164. 167.
2 SexMale 13.7 1.11 12.4 2.71e- 25 11.6 15.9
\[ \operatorname{\widehat{Height}} = 165.6 + 13.75(\operatorname{Sex}_{\operatorname{Male}}) \]
Sex emmean SE df lower.CL upper.CL
Female 166 0.787 166 164 167
Male 179 0.787 166 178 181
Confidence level used: 0.95
A linear model can be used to replace a t-test
(Comparison of a single outcome across two independent groups)
Why should we prefer this to a t-test?
Let’s repeat this process to understand how hand span depends on height.
Can you write down a model for how hand span depends on height? Can you estimate it using R?
Call:
lm(formula = Wr.Hnd ~ Height, data = dat)
Residuals:
Min 1Q Median 3Q Max
-4.8644 -0.8644 0.0230 0.8140 4.8044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83544 1.96131 -1.446 0.15
Height 0.12545 0.01135 11.051 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.461 on 166 degrees of freedom
Multiple R-squared: 0.4238, Adjusted R-squared: 0.4204
F-statistic: 122.1 on 1 and 166 DF, p-value: < 2.2e-16
Interpret and diagnose (test model assumptions) this model
Can we better explain the hand width model by adding a second predictor?
Call:
lm(formula = Wr.Hnd ~ Height + Sex, data = dat)
Residuals:
Min 1Q Median 3Q Max
-4.5715 -0.8668 0.0558 0.9587 3.9863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.90837 2.49397 1.567 0.119
Height 0.08281 0.01503 5.509 1.36e-07 ***
SexMale 1.22353 0.29853 4.099 6.51e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.396 on 165 degrees of freedom
Multiple R-squared: 0.4771, Adjusted R-squared: 0.4707
F-statistic: 75.27 on 2 and 165 DF, p-value: < 2.2e-16
Compare the two models. What do they tell you? Which is fitting the data better?
\[ \operatorname{\widehat{Wr.Hnd}} = -2.84 + 0.13(\operatorname{Height}) \]
\[ \operatorname{\widehat{Wr.Hnd}} = 3.91 + 0.08(\operatorname{Height}) + 1.22(\operatorname{Sex}_{\operatorname{Male}}) \]
[1] 1.461155
[1] 1.396227
Is the new model significantly better?
Finally, we might ask if there’s any evidence that the slope is different between men and women.
How could we make a model to test this hypothesis?
model5 <- lm(data=dat, Wr.Hnd ~ Height + Sex + Height:Sex)
model5 <- lm(data=dat, Wr.Hnd ~ Height * Sex)
extract_eq(model5)\[ \operatorname{Wr.Hnd} = \alpha + \beta_{1}(\operatorname{Height}) + \beta_{2}(\operatorname{Sex}_{\operatorname{Male}}) + \beta_{3}(\operatorname{Height} \times \operatorname{Sex}_{\operatorname{Male}}) + \epsilon \]
Call:
lm(formula = Wr.Hnd ~ Height * Sex, data = dat)
Residuals:
Min 1Q Median 3Q Max
-4.5776 -0.8609 0.0488 0.9659 4.0437
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.58874 4.37585 1.277 0.2033
Height 0.07266 0.02641 2.751 0.0066 **
SexMale -1.33531 5.47715 -0.244 0.8077
Height:SexMale 0.01505 0.03216 0.468 0.6405
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.4 on 164 degrees of freedom
Multiple R-squared: 0.4778, Adjusted R-squared: 0.4682
F-statistic: 50.01 on 3 and 164 DF, p-value: < 2.2e-16
\[ \operatorname{\widehat{Wr.Hnd}} = 3.91 + 0.08(\operatorname{Height}) + 1.22(\operatorname{Sex}_{\operatorname{Male}}) \]
\[ \begin{aligned} \operatorname{\widehat{Wr.Hnd}} &= 5.59 + 0.07(\operatorname{Height})\ - \\ &\quad 1.34(\operatorname{Sex}_{\operatorname{Male}}) + 0.02(\operatorname{Height} \times \operatorname{Sex}_{\operatorname{Male}}) \end{aligned} \]
(ggplot(data=augment(model3), aes(Height, Wr.Hnd)) + geom_point() + geom_line(aes(y=.fitted), lwd=1))+
(ggplot(data=augment(model4), aes(Height, Wr.Hnd, color=Sex)) + geom_point() + geom_line(aes(y=.fitted), lwd=1) +scale_color_manual(values=c("black", "red"))) +
(ggplot(data=augment(model5), aes(Height, Wr.Hnd, color=Sex)) + geom_point() + geom_line(aes(y=.fitted), lwd=1) +scale_color_manual(values=c("black", "red"))) +
plot_layout(guides="collect")ANOVA compares the (1) improvement in fit between models to (2) the improvement that would be expected if there was no real difference in the population.
The p-values reflect the statistical significance of the improvement as we increase the model complexity.
Analysis of Variance Table
Model 1: Wr.Hnd ~ Height
Model 2: Wr.Hnd ~ Height + Sex
Model 3: Wr.Hnd ~ Height * Sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 166 354.41
2 165 321.66 1 32.747 16.7184 6.777e-05 ***
3 164 321.23 1 0.429 0.2189 0.6405
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is the interaction term needed?
What statistical tests/procedures have we replaced with a linear model?
All other statistical tests can be reframed in this way:
https://lindeloev.github.io/tests-as-linear/#9_teaching_materials_and_a_course_outline
Look at the diversity data in R.
This is data from a real experiment to explore gut microbial diversity over time in hospital patients.
We’ll use linear models to test whether diversity changes over time.
So far we’ve seen data that can be modelled using normal distributions, where the relationships are genuinely straight lines.
Biological data very often does not act like this
What options do we have?
Funkier models?
Transformations?
The body mass of animals is linked to their metabolic rate.
Can we fit a linear model to this?
The same data plotted on a different scale:
Instead of modelling how \(y\) depends on \(x\), we can model how \(\log(y)\) depends on \(\log(x)\).
\[\log(y_i) = a + b\log(x_i) + \epsilon_i\]
\[y_i = \exp(a + b\log(x_i) + e) = e^a\times x_i^{b}\times e^{\epsilon_i}\]
So instead of an additive model, we have a multiplicative one.
But we estimate using R as a regular linear model, using \(\log(y)\) as the outcome and \(\log(x)\) as the predictor.
modelAnimals <- lm(data=animals , log(metabolic.rate) ~ log(body.mass)*phylum)
summary(modelAnimals)
Call:
lm(formula = log(metabolic.rate) ~ log(body.mass) * phylum, data = animals)
Residuals:
Min 1Q Median 3Q Max
-4.8236 -0.4098 0.0867 0.5732 2.4220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.55146 0.15437 -3.572 0.000368 ***
log(body.mass) 0.86224 0.01496 57.641 < 2e-16 ***
phylumChordata 1.45940 0.16753 8.711 < 2e-16 ***
log(body.mass):phylumChordata -0.25689 0.02493 -10.305 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9566 on 1180 degrees of freedom
(2394 observations deleted due to missingness)
Multiple R-squared: 0.9588, Adjusted R-squared: 0.9587
F-statistic: 9161 on 3 and 1180 DF, p-value: < 2.2e-16
animals$predicted <- predict(modelAnimals, newdata = animals, type="response")
augment(modelAnimals, interval="prediction") |>
ggplot(aes(x=exp(`log(body.mass)`), y=exp(`log(metabolic.rate)`), col=phylum, fill=phylum)) +
geom_point() +
geom_line(aes(y=exp(.fitted))) +
geom_ribbon(aes(y=exp(.fitted),ymin=exp(.lower),ymax=exp(.upper)),alpha=0.2) +
facet_wrap(~phylum, scale="free")Linear models can be useful for data that does not look linear!
Next time:
Binary data (logistic regression)
Multilevel models (more complex errors)